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1.  Introduction 


In  this  paper  we  shall  review  some  o£  the  recent  work  on  nonparmaetrlc 
and  sequential  rules  In  statistical  pattern  recognition  along  with 
criticisms,  and  Indicate  some  new  results  In  these  areas.  Suanary 
reviews  of  the  literature  are  given  by  Das  Gupta  [lo]  *nd  l^anal  [22] * 

The  basic  problem  In  statistical  pattern  recognition  can  be 
formulated  as  follows.  Let  o)  be  a point  In  a san^le  space  with  an 
associated  a-fleld  of  events  and  a probability  measure.  Let  X(tt>) 
denote  a real-valued  vector  of  measurements  on  w and  let  l(o)) 

denote  the  pattern-class  of  (s  which  takes  values  In  (l,2 k)  . 

The  problem  Is  to  predict  l(o))  from  the  knowledge  of  X(o»)  . Denote 
the  pattern-class  probabilities  by  ? ■ » where 

B Bt(IbI)  , and  the  class  distributions  by  F ■ (F^^,  ...,Fj^)  , 
where  F^  Is  the  conditional  distribution  of  X , given  I ■ 1 ; let 
F^  admit  a density  f^  with  respect  to  a a-flnlte  measure  pi  . 

The  above  problem  can  arise  In  different  forms  due  to  different 
situations  and  structures  of  available  knowledge  as  discussed  below. 

The  problem  may  be  to  classify  a single  unit  or  more  than  one  unit 
which  may  occur  In  a single  batch  or  In  a sequence.  Moreover,  the 
units  to  be  classified  may  belong  to  the  same  pattern-class  (l.e.,  when 
units  are  sampled  from  the  space  conditioned  on  some  value  of  I ),  or 
to  different  pattern-classes.  In  almost  all  the  problems,  F and  ^ 
are  not  coiBpletely  known  although  It  may  be  known  that  Fj^X...)f|^ 
belong  to  a given  set  Q . In  order  to  get  more  Information  on  F and 
^ , data  In  the  name  of  a "training  sample"  are  collected  In  one  of  the 
following  ways. 
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(a)  Separate  samples  from  k pattern-class  populations,  denoted 

by  TT^ . In  this  case  ^ Is  Interpreted  as  a prior  probability 

vector. 

(b)  Sample  from  the  mixed  population  (denoted  by  TT  ) which  Is  a 

mixture  of  In  the  proportion  . 

SoBietlmes  when  the  units  to  be  classified  occur  In  a sequence,  ob- 
servations on  the  first  1 units  are  taken  as  a training  sample  to 
predict  the  pattern-class  of  the  (i-t'l)th  unit. 

A training  sample  may  be  Identified,  l.e.,  the  observations  are 
available  on  both  X and  I , or  unidentified  when  the  observations  are 
available  on  X . 

The  so-called  nonparametrlc  methods  arise  when  given 

explicitly  In  simple  parametric  forms.  It  Is  well  known  that  there 
Is  no  distribution-free  rule  for  this  problem.  So  the  performance  of 
a rule  cannot  be  evaluated  (except  some  asymptotic  results  and  broad 
bounds)  without  additional  knowledge  of  the  underlying  0 . A Bayes -rule 
can  be  easily  derived  when  | and  f^'s  are  known.  The  major  bulk  of 
tL-e  literature  Is  devoted  on  plug-ln  versions  of  a Bayes*  rule,  l.e., 
when  the  unknowns  are  replaced  by  their  respective  estimates  (derived 
from  the  training  sanple)  In  the  form  of  the  Bayes*  rule.  For  this  non- 
parametrlc problem,  generally  estimates  of  f^*s  and  ^ are  used;  asymp- 
totic properties  of  such  rules  are  then  easily  derived  from  the  asymptotic 
properties  of  the  estimates  used.  Rules  based  on  tolerance  regions  and 
nearest -neighbors  are  also  discussed  In  the  literature;  some  of  these 
rules  Indirectly  use  estimates  of  density  functions.  Another  class  of 
rules  are  suggested  following  the  nonparametrlc  methods  for  the  two-sample 


problem;  these  rules  are  based  on  general  U-statlstlcs,  estimates  of 
e.d.f.'s,  ranks,  and  permutation-invariance.  A class  of  rules,  called 
"empirical  best-of-class  rules"  is  also  under  study;  these  rules  are 
optimum  in  some  sense  when  they  are  applied  on  the  identified  training 
sample. 

Seqtiential  rules  also  arise  in  different  situations.  There  can  be 
rules  based  on  sequential  experimentation  on  the  components  of  X(cd)  , 
although  there  is  not  any  result  in  this  area  worth  mentioning  (except 
possibly  some  heuristic  methods).  Next,  a training  sanq>le  may  be  ob- 
tained sequentially  and  rules  may  be  devised  based  on  such  sequential 
experiments.  Furthermore,  when  the  units  to  be  classified  belong  to 
the  same  pattern-class,  they  may  also  be  observed  through  a sequential 
experiment.  It  may  be  noted  that  when  the  units  to  be  classified  occur 
in  a sequence  from  TT  , a sequential  rule  may  be  devised,  although  a 
sequential  experimentation  in  such  a case  is  not  meaningful.  All  the 
papers  in  this  area  deal  with  direct  applications  of  sequential  two-sample 
tests.  Unfortunately,  very  little  has  been  done  so  far. 

Although  there  are  some  papers  on  Monte-Carlo  studies  on  performances 
of  some  rules,  the  studies  on  robustness  and  relative  efficiency  have  to 
be  done  much  more  intensively  and  carefully.  Asymptotic  results  are  of 
theoretical  interest;  however,  good  bounds  on  error -probabilities  and 
studies  on  errors  of  approximation  will  be  more  valuable. 

2.  Notations  and  Preliminaries. 

Let  us  first  restrict  our  attention  to  the  case  k ■ 2 , and  the 
sit\iatlon  where  the  pattern  class  of  one  unit  is  to  be  predicted.  A 
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decision  rule,  not  depending  on  the  training  sample,  Is  given  by 
S ■ &^(x)  Is  the  probability  of  deciding  1 as  the 

correct  pattern  class,  given  the  observation  X ■ x.  The  probabili- 
ties of  error  for  the  rule  fi  are  given  by 

Qj^(6;f)  «J*  figfjdu,  o^(8;f)  ■ J* 

where  f • risk  with  the  prior  distribution 

function  (or,  the  total  probability  of  error) 

Is  given  by 

R(6f;.f)  - ?iOi(6;f)  + 

Given  % and  f,  a Bayes  rule  6'*(.;^,f)  which  minimizes  R(6:^,f)  Is 
given  by 

0,  if  ?^fj(x)  < ^gfgCx) 

Let  R*(?,f)  a R(6*(. . 

Let  6 stand  for  (^,f)  or  (^,F).  When  0 Is  known  a "good"  rule 
generally  depends  on  0,  and  It  Is  denoted  by  6(.  ;0)  ; 6'*'’  Is  such  a 


rule.  We  shall  drop  ^ from  0 when  the  population  Is  not  mixed  and  ^ 

Is  not  given.  When  6 explicitly  depends  on  0,  we  shall  write  a^(6;f) 
as  a^(8;@). 

Now,  consider  the  problem  when  0 Is  not  completely  known.  Information 
on  0 Is  based  on  a training  sample  S^.;  In  case  the  sampling  Is  done 
separately  from  iT^'s,  N will  stand  for  the  vector  of  sample  sizes.  A 
decision  rule,  in  that  case.  Is  denoted  by  8jj(.,Sj^)  - 
In  particular,  such  a rule  may  be  a plug-ln  version  of  6(.;0),  given  by 


8(*;6j,)  ■ where 

of  6i  we  shall  often  write 


The  conditional  probabilities  of  error  and  the  conditional  risk  of 
given  are  given  by 

®lc^*N’®N’^^  "J*  (l  + J).  »od 

^c^^N’®N*®^  “ ifi  5i**1c^®N^”®H^ 

The  \mcondltlonal  probabilities  of  error  and  the  risk  of  are  given  by 

*<•*'«)  - '« 

Where  denote  the  expectation  over  S^. 

When  there  are  more  than  one  unit  to  be  classified,  and  it  is  known  that 
they  come  from  the  same  population,  the  above  development  can  easily  be 
extended.  However,  when  the  units  occur  in  a sequence  one  may  adopt 
Sacuel's  approach,  although  no  results  are  available  in  the  literature  when 
the  densities  are  not  known.  When  the  units  to  be  classified  arise  from  the 
mixed  population  one  may  use  the  compound  decision  approach  as  suggested  by 
Hobblns  [3k]  and  later  developed  by  Hannan  and  Robbins  [I6],  Samuel  [36,37] 
and  Tao  [$3]*  however,  all  these  papers  assume  that  the  class -densities  are 
known.  See  VanRysin  [k3]  for  a similar  developpient  when  the  distributions  are 
unknown.  When  the  units  to  be  classified  arise  in  a sequence  from  TT  and  for 
classifying  the  ith  unit  the  <rt>servations  on  all  the  previous  identified  (by  a 
"teacher")  units  are  used  as  a training  sample,  a completely  separate  theoretical 
development  would  be  called  for.  However,  all  the  papers  dealing  with  this  prob- 
lem put  emphasis  only  on  the  prediction  of  the  class  of  ith  unit  using  the 
standard  theory  discussed  above. 

3.  Wonparamstrlc  Rules. 

3*1'  A Sisiple  Approach. 


Most  of  the  papers  deal  with  asymptotic  properties  c -ules  based  on  a 


training  sample  as  N e».  In  particular,  the  convergences 

(and  their  rates)  of  conditional,  as  well  as  unconditional,  proba- 
bilities of  error  and  risk  are  dealt  with.  Special  emphasis  is  given 

A •* 

to  the  rule  , the  plug-ln  version  of  a Bayes  rule  o . Bounds  on 

fl 

probabilities  of  correct  classification  for  some  heuristic  rules  are 
also  available  In  some  papers.  The  above  convergences  as  N » and 
the  number  of  units  to  be  classified  tend  to  <*>  are  also  discussed  In 
some  special  cases . 

We  present  the  following  asymptotic  results  which  can  be  proved 
under  very  general  conditions  as  the  size  (or  sizes)  N of  the 
training  sample  tend  to  cs. 

Suppose  ft(x;§^)  &(x;6)  In  probability  (a.s.)  for  almost  all  (ii) 

X.  Then 

(I)  59)»  in  prob.  (a.s.), 

(II)  G^(Sjj;0)  -*  o^(6;0), 

(III)  “►R(S;0)»  in  prob.  (a.s.) 

(IV)  R(8jj;0)  -*R(fi;0). 

In  the  above  result  for  convergences  of  risks,  the  condition  "for 
almost  all  x"  may  be  relaxed  by  the  condition  "for  almost  all  x in  the 
set  (x:  ?jfj^(x)  + ?2^2(x)}." 

In  particular  suppose  6^^  (x,0)  *1,  If  D(x,0)  > 0,  where  d(x;0)  Is 
a function  of  x when  0 Is  known.  Horeover,  assume  that  D(x,0)  equals 
zero  on  a mill  set,  (although  this  condition  can  be  slightly  relaxed). 

If  d(x,§|^)  d(x,  0)  In  probability  (a.s.)  for  almost  all  x,  the  above 
results  on  convergences  hold.  The  primary  requirements  for  the  above 
conditions  to  hold  are  that  d(x,0)  Is  continuous  In  0 for  almost  all  x, 
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and  f^(x)  -»  f(x)  In  prob.  (a. a.),  where  la  an  eatimaCe  of  £^. 

The  detailed  proof  of  theae  reulta  will  be  given  elaewhere.  For  aimllar 
but  weaker  reaulta,  aee  Johna  [21]  and  Click  [16].  It  may  be  noted 
that  the  above  reaulta  do  not  require  6 to  be  a Bayea  rule  and  fj^(x) 
Integrable;  these  assumptions  are  used  In  almost  all  the  papers  In  this 
area.  The  above  results  can  be  used  to  sinq>llfy  the  proofs  of  many 
known  results. 

The  problem  may  also  be  handled  from  decision  theoretic -viewpoint 
with  provisions  for  "withholding  decision"  or  "doubtful  regions."  See 
kao  [32]  for  this  approach  when  the  distributions  are  known  ; and 
Patrick's  book  [29]  for  some  heuristic  developments  when  the  distribu- 
tions are  unknown. 


3.2.  Rules  baaed  on  estimates  of  density  functions. 

7* 

All  the  Inqiortant  papers  deal  with  asymptotic  properties  of  0^ 
with  various  estimates  of  $ , Recall  that  one  may  define  6 In  either 
of  the  following  ways: 

A.  6*(x;e)  - 1 , iff  D(x,9)sqfj^(x)-?gf2(x)  i 0 . 

B.  fi*(x,*9)  - 1 , iff  Dj^(x;0)  a ?j^f^(x)/[5j^fj^(x)+^fg(x)]  i 1/2  . 

The  following  methods  for  estimating  f^'s  are  mostly  used. 

(1)  Flx-Hodges*  [12]  method;  later,  modified  by  Loftsgaarden  and 
Quesenberry  [2^1  • 

(11)  Alzerman-Braverman-Rozonoer *s  method  [ 1 ] . 

(ill)  Cencov's  method  [6] 
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(iv)  Farzen-Cacoullos ' mechod  [28,5]* 

(v)  Wolverton-Wagner-Yamato's  recursive  method  [30,^2]. 


The  knovn  results  on  are  then  used  to  derive  asymptotics  for  . 

Fix  and  Hodges  [ 12]  essentially  proved  (ll)  with' their  suggested 
estimators  of  f^'s  , when  the  training  saiq>le  is  separately  drawn  from 
TT^'s  and  ^ a (l/2>l/2)  . Johns  [21]  proved  the  same  result  with  a 
minor  modification  of  Fix-Hodges*  estimates  when  the  training  sample  is 
identified  and  drawn  from  the  mixed  population.  However,  Johns  considered 
the  problem  with  a general  loss  structure  and  more  general  space  of  values 
for  the  pattern-class  indicator  variable  I.  Van  Ryzin  [44]  proved  the 
result  III  (in  probability)  with  estimates  (il),  (ill)  and  (iv),  when  the 
training  sample  is  separately  drawn  from  • He  also  studied  the 

rates  of  these  convergences.  Van  Ryzin  [42]  proved  the  result  IV  trlth 
estimates  similar  to  (iv)  when  the  training  sample  is  an  identified  sample 
from  the  mixed  population;  he  also  obtained  bounds  on  R(6jj:6)  - R(6  >0) 
under  additional  conditions  on  6 . 

Another  approach  used  in  the  literature  is  to  estimate  D(x;0)  or 
D^(x,6)  directly.  Recursive  decision  rules  are  considered  (for  easier 
updating),  especially  when  the  units  to  be  Identified  occur  in  a sequence 
from  TT  anw  the  correct  pattern-class  of  a unit  is  known  after  its  pre- 
diction. For  classifying  the  1th  unit,  all  the  observations  on  the 
previous  units  constitute  a training  sample.  Suppose  6^  depends  on 
Djji(x;Sn)  or  Djj(x;Sjj)  as  6 depends  on  Dj^(x;0)  or  D(x,0)  , 

respectively;  that  is,  and  are  respective  estimates  of  and 

D . It  is  shown  by  Van  Ryzin  [43]  that  if 


J‘[D|j(x;Sj^)-D^(x,0)]®  f^(x)dx  -*0 


in  probability,  where  f^(x)  ■ , then  the  result  III 

(in  probability)  holds.  Van  Ryzin  [U^]  suggested  a stochastic  and  re- 
cursive algorithm  for  estimating  Dj^(x;0)  for  which  (v)  holds  under 
fairly  general  conditions.  His  algorithm  essentially  involves  window- 
kernels  for  estimating  the  density  fimctions.  Van  Ryzln's  work  was  in- 
spired by  the  work  of  Aizerman  et  al  [ 1 ] who  proved  (v)  with  recursive 
estimates  using  essentially  method  (i)  along  with  an  additional  assump- 
tion that  D^(x;9)  is  a finite  linear  combination  of  some  known  ortho- 
normal functions  in  . For  a generalization  of  the  above  work,  see 
Gyorfi  [17].  Wolverton  and  Vagner  [5I]  proved  that 

(VI)  J'[Djj(x;Sjj)-D(x,0)]^  dx  ^ 0 

in  prob./a.s.  implies  the  result  III  in  prob./a.s.,  when  f^'s  are 
uniformly  continuous  (on  R™  ) . They  suggested  recursive  estimates  of 
D(x,6)  for  which  VI  holds  in  probability  when  f^^'s  are  uniformly 
continuous  and  in  a.s.  when,  in  addition,  satisfy  uniform  Lipschitz 

condition.  They  also  studied  the  rate  of  convergence  when,  specifically, 
f^'s  have  bounded  supports.  Similar  result  on  rate  of  convergence  in 
probability  was  obtained  by  Rejto  and  Revesz  [33].  Watanabe  [k9,30]  proved  (Vl) 
in  prob.  and  a.s.  along  with  their  rates  using  recursive  estimates  for 
D(x,6)  following  the  method  (v).  Similar  problem  was  studied  by  Tanaka 
[4l]  when  the  training  sample  constitutes  dependent  observations.  Pelto 
[30]  suggested  to  take  the  width  of  the  window-kernel  for  estimating 
densities  as  the  value  which  minimizes  the  deleted-countlng  estimate  of 
the  risk;  however,  his  paper  does  not  give  any  algorithm  and  his  proofs 
are  all  based  on  heuristics. 
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It  may  be  remarked  that  the  rates  of  convergences  derived  In  the 
papers  cited  above  only  reflect  the  performances  of  the  suggested  rules 


for  future  prediction,  Ignoring  their  performances  In  the  past. 

In  passing,  It  may  be  noted  that  the  estimate  of  density  function 
by  method  (l)  Is  not  Integrable;  this  estimate  Is  further  studied  by 
Moore  and  Henrlchson  [27]  and  Wagner  [k7].  See  also  a recent  paper  by 
Wahaba  [k8]. 

3,3,  Use  of  the  two-sample  test  procedures. 

Let  (x, ,...,X  ) and  (Y. ,...,Y  ) be  the  observations  on  random 

1 ni  ^ “2 

samples  from  TT^  and  TT^  respectively.  Let  (z^,...,Z^)  be  the  obser- 
vations on  the  n units  to  be  classified.  These  units  form  a random 
sample  either  from  TT^  or  from  TT^  . Let  be  the  common  c.d.f.  of 

Zj^'s  . Then  the  problem  Is  to  test  Fq  = vs.  Fq  = F^  . 

Two-sample  test  statistics  are  often  used  to  devise  rules  for  the 
above  problem.  The  basic  heuristic  Ideas  can  be  described  as  follows. 

(a)  Use  Z's  and  x's  to  test  Fq  = F^^  vs.  Fq  » F^  ; let  T^^ 
be  a test  statistic  such  that  large  values  of  T^^  lead  to  the  rejection 

of  Fq  = Fj^  . Similarly  use  Z's  and  Y's  to  test  Fq  = Fg  ^0  “ ^1  * 

let  Tg  be  a test  statistic  such  that  large  values  of  T^  lead  to  the 
rejection  of  ” ^2  * define  a rule  which  accepts  Fq  « If 

Tj^  < Tg  , and  accepts  Fq  = F^  If  ^2  ^ *^1  * also  compare  the 

critical  levels  of  T^  and  T^  Instead  of  comparing  T^  and  T^  directly. 

(b)  Assume  Fq  a F^  and  treat  z's  and  X's  as  1.1. d.  observations. 
Get  an  estimate  of  the  divergence  between  F^  and  F^  by  using  a test 
statistic  for  testing  Fj^  « F^  vs.  ^ F^  . Similarly  assume  Z's  and 
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T'c  l.l.d.  obsewaticms  and  determine  the  correapondlng  estimate  of 
the  divergence.  A rule  now  can  be  devised  by  conparlng  these  two  estimates 
of  divergence. 

It  Is  well-known  that  a distribution- free  rule  cannot  be  derived  for 
the  pattern  recognition  problem  posed  above.  The  performance  of  a rule 
can  only  be  judged  in  specific  situations  except  for  some  broad  asymptotic 
results . 

Das  Gupta  [ 9 ] used  the  Idea  (a)  and  suggested  a rule  based  on  Wllcoxon- 
statlstlc;  he  showed  that  such  a rule  is  consistent  (l.e.,  the  error  proba- 
bilities tend  to  0 as  n^n^.n^  -»  • ).  Hudlmoto  [20]  also  used  Wllcoxon- 
statlstic  when  F^(x)  ^ F^Cx)  for  all  x , and  derived  some  bounds  for  the 
probabilities  of  error.  Kinderman  [23]  used  more  or  less  the  idea  (b)  In 
deriving  rules  based  on  rank-scores  when  F^(x)  « F2(x-fd)<  6 > 0 . and 
studied  the  asymptotic  efficiencies  of  Chose  rules  in  Pitman's  sense. 

Chandra  and  Lee  [ 7]  suggested  a rule  which  first  tises  Wilcoxon-test  for 
F^(x)  > Fg(x)  vs.  F^(x)  < FgCx)  based  on  x's  and  Y's  > and  then 
tests  Fq  a F^  vs.  Fq  a Fg  using  another  Wllcoxon-type  statistic 

and  the  result  of  the  first  test.  They  studied  the  asyptotlc  properties  of 
this  rule  as  n^  and  n^  tend  to  «•  . It  may  be  noted  that  this  rule  Is 
asymptotically  equivalent  to  Das  Gupta's  rule  [9]*  See  Chatterjee  (j.  Multlv. 
Anal.,  1973*  3>  26-^^6)  fer  a related  work. 

A minimum  distance  rule  can  be  defined  following  (a)  above  with  T^ 
and  Tg  as  distances  between  Che  respective  empirical  c.d.f.'s.  Matuslta 
[26]  obtained  bounds  on  the  probabilities  of  error  of  such  a rule  for  the 
discrete  case  with  Matuslta -distance.  Das  Gupta  [9  ] proved  the  consistency 
(as  n,n^,n2  -»  eo)  of  such  rules  and  derived  bounds  for  the  probabilities 
of  error  using  Kolmogorov-distence.  No  detailed  studies  on  these  rules  are 
yet  available. 
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For  simplicity,  consider  the  minimum  distance  rule  with  Kolmogorov- 


distance  for  n a 1.  It  can  be  shown  that  such  a rule  decides  > F, 
if  lrj^/nj^-l/2|<lrg/n2-l/2l,  and  - F^  if  I rj^/n^-l/2|>|  rg/ng-l/21 , 
where  r^^  ■ # of  X^'s  < Zj^  and  r^  ■ # of  < Zj^.  We  have  studied  the 

probabilities  of  error  of  this  rule,  and  the  results  will  be  reported 
elsewhere.  In  particular,  suppose  Fj^(x)  ■ G(x-9j^),  Fg(x)  ■ G(x-0g), 
where  G is  a continuous  distribution,  symmetric  about  0.  Then  both 
the  conditional  probabilities  of  error  tend  to  G(-|  6j^-02|/2)  with 
probability  1 as  nj^,ng  -»  <». 

3.4  Use  of  Tolerance  Regions. 

The  use  of  tolerance  regions  (and  statistically  equivalent  blocks)  in 
classification  was  suggested  by  Anderson  [ 3 ].  The  basic  idea  is  quite 
related  to  the  problem  of  estimating  a density  function,  and  in  that  form 
it  appears  in  the  work  of  Fix  and  Hodges  [12].  Quesenberry  and 
Gessaman  [3I]  suggested  a method  for  contructlng  a rule  based  on  tolerance 
regions  which  is  asymptotically  optimal;  however,  their  idea  is  not  very 
useful  since  the  construction  of  such  a rule  depends  on  some  inherent  known 
structures  of  the  distributions.  Later,  Anderson  and  Henning  [ 2 ] and 
Beakley  and  Tuteur  [ 4 ] suggested  some  other  heuristic  models.  So  far  no 
theoretical  results  are  available  on  these  rules.  This  is  due  to  the  fact 
that  very  little  is  known  on  the  performance  of  a tolerance  region  under  a 
different  distribution.  Gessaman  and  Gessaman  [14]  studied  some  of  these 
rules  by  Monte  Carlo  method. 

3.5  Empirical  Best-of-Class  Rules. 

The  basic  idea  can  be  described  as  follows:  Consider  a class  A of 
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rules,  and  let  C(s)  be  the  proportions  of  correct  Identifications  when 
6ed  Is  spiled  to  Identify  the  observations  in  the  training  sanple.  Let 


6„cA  be  a rule  which  maximizes  C(6)  In  Then  Is  called  an 
N w 

* 

empirical  best  d-class  rule.  Let  6 ed  be  a rule  which  maximizes  the 
probability  of  correct  classification  In  the  class  A. 

Stoller  [40]  proved  the  convergence  (In  proh)  of  the  conditional 
risk  of  6^  to  the  risk  of  6 in  the  univariate  case  when  d Is  the 

S 

class  of  rules  determined  by  single  cutoff  points.  Click  studied  the 
convergence  (a.s.)  of  C(6»)  to  the  probability  of  correct  classification  | 

of  6 with  special  emphasis  on  the  class  of  linear  rules.  The  other  j 

papers  In  this  area  can  be  found  In  Duda  and  Hart  (Pattern  Classification  and 
Scene  Analysis,  19T3>  Wiley),  although  these  are  not  of  much  statistical  Interest.! 

General  results  regarding  the  existence  of  6^  and  algorithm  to  j 

determine  6^  are  not  yet  available  (except  In  the  case  considered  by  i 

Stoller).  General  asymptotic  results  easily  follow  from  the  known 
asymptotic  properties  of  empirical  c.d.f.'s. 

We  suggest  the  following  rule  In  the  multivariate  case,  which  Is  easy 
to  apply.  First  treat  the  problem  separately  for  each  variate.  However 
this  will  lead  to  some  Inconsistent  decisions  or  Indecision  zones.  Each  of 
these  zones  can  then  be  treated  successively  and  separately  by  each  variate 
to  successively  reduce  the  nund>er  of  such  Indecision  zones.  It  Is  believed 
that  this  rule  will  be  asymptotically  optimal  for  a large  number  of  classes. 

3.6  Rank-Distance  Rules. 


The  basic  Idea  Is  to  find  distances  of  the  observations  in  the  training 
sample  from  the  observation  X to  be  classified  and  construct  rules  based  on 


Cha  ranks  of  these  distances  and  the  corresponding  pattern-class  numbers. 


Fix  and  Hodges  [I2]  suggested  the  following  rule  6^,  termed  as 
1-NN  rulet  Classify  X to  the  pattern-class  of  the  nearest  neighbor  (NN) 
of  X.  It  can  be  shown  (possibly  given  In  [I2]  ) that  under  mild  condi- 
tions. the  probability  of  mis classification 

N I*  ^1^*^  ^ 

PjfjU;  + Pgfj^(x;’ 

where  p^^  ■ lim  + Og)*  as  (the  sample  sizes  from  TTj^  and  TTg) 

tend  to  «.  For  the  mixed  population  case.  Cover  ^uld  Hart  [ 8 ] have  shown 
that 


R(8^;8)-»J‘ 


2?i?gfi(x)  fg(x)  dx 
?^f^(x;  + “ ^o 


as  N SB,  when  the  sample  space  of  each  is  R*"*  (or  slightly  more 

general).  Wagner  [h6]  has  shown  that  under  very  mild  conditions  the 
conditional  risk  of  6^^  tends  to  R^  In  probability  (in  a.s.  under  addi- 
tional restrictions). 

Fix  and  Hodges  [12]  also  suggested  the  rule  which  can  be 

described  as  follows.  Let  be  the  nund>er  of  observations  in  the  train- 

ing sample  with  the  pattern  class  1 that  belong  to  the  nearest 
neighbors  of  X.  Then  the  rule  decides  the  pattern-class  of  X as  1, 

if  l^j/n^  > »\,2/"2,  where  n^  Is  the  number  of  observations  In  the  train- 
ing sample  %rlth  the  pattern-class  1.  One  may  also  consider  a rule  by  com- 
paring and  ^2*  [21]  stated  that  under  mild  conditions  the 

risk  of  the  later  rule  tends  to  the  risk  of  6 (Bayes ) , when  K» 


K^/M  -»  0.  The  convergence  of  the  conditional  risk  of  this  rule  to  the 
risk  of  6 (in  different  modes ) can  be  obtained  from  the  results  In  3.1. 

Some  other  theoretical  results  on  are  claimed  In  the  literature, 

although  they  were  not  proved  with  rigor. 

It  may  be  noted  that  NN  rules  are  also  related  to  the  rules  based  on 
estimates  of  density  functions.  All  the  papers  in  this  area  deal  only 
with  the  problem  of  classifying  one  unit. 

3.7  Conclusion. 

Many  nonparametric  rules  are  suggested  In  the  literature  from  heuristic 
viewpoints.  Asymptotic  properties  of  most  of  these  rules  are  not  difficult 
to  obtain,  although  good  studies  on  the  rates  of  convergences  and  asynptotlc 
expansions  of  risks  would  be  more  useful.  The  usefulness  of  a rule  Is 
determined  by  Its  simplicity,  as  well  as,  by  Its  robustness.  Studies  on 
robustness  and  small -sanple  behavior  of  these  rules  are  quite  limited.  The 
relative  comparisons  (finite-sample  or  asymptotic)  of  some  of  the  popular 
rules  In  specific  situations  are  also  called  for. 

h.  Sequential  Rules. 

The  sequential  pattern  recognition  may  involve  sequential  experimentation 
and  sequential  decision  rules,  A sequential  experiment  may  arise  In  any 
combination  of  the  following  three  situations,  (a)  Selection  of  components 
of  tlie  measurement  vector  on  each  unit  in  the  training  sample , as  well  as , 

In  the  sample  of  units  to  be  classified,  (b)  Selection  of  the  sample  size 
of  the  training  sample,  (c)  Selection  of  the  sample  size  of  the  units  to 
be  classified  when  all  the  units  are  known  to  belong  to  one  population.  The 
basic  object  for  using  a sequential  rule  is  to  attain  prescribed  probabilities 
of  errors  and  to  reduce  the  average  sanple  size;  in  Bayes*  formilatlon 
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(involving  probabilities  of  errors  and  cost  of  observations)  the  object 
Is  to  reduce  Bayes*  risk.  When  the  pattern-class  densities  are  known, 
a Bayes  rule  can  easily  be  derived  following  Nkld's  forsulatlon;  however, 
when  the  densities  are  unknown  the  problem  Is  too  Involved  and  no 
satisfactory  results  are  yet  available. 

Following  the  Ideas  of  Hoeffdlng  and  Wolfowltz  [I9]  , Das  Gupta 
and  Klnderman  [11]  Introduced  an  Important  notion  termed  as  "classlf la- 
bility" for  the  situations  (b)  and  (c)  above.  Consider  three  Independent 
random  vectors  (of  the  same  dimension)  X^,  and  X^,  where  the  c.d.f . of 
X^  Is  F^.  It  Is  known  that  F Is  either  eqiial  to  F^^  ^2*  ^1  ^ ^2 

lies  la  a given  set  0.  The  problem  Is  to  decide  Pq  " or  ■ F^ 
based  on  a sequence  of  observations  on  (XQ,X^,Xg).  This  problem  Is  said 
to  be  sequentially  (finitely)  classifiable,  if  for  every  a(0<D<l)  there 
exists  a sequential  rule  (finite  sample  size  rule)  which  terminates  with 
probability  1 such  that  the  probabilities  of  that  rule  are  uniformly 
(in  n)  less  than  a.  Necessary  and  sufficient  conditions  for  sequential 
and  finite  classlflabllity  are  given  by  Das  Gupta  and  Klndetmsn  [11]. 

With  the  object  of  controlling  the  error  probabilities  uniformly  and 
arbitrarily  It  Is  also  inq>ortant  to  find  out  whether  It  Is  necessary  and 
sufficient  to  get  observations  only  on  X^(or  on  X^)  or  on  both  Xj^  and  X^. 
This  problem  Is  analysed  also  In  Das  Gvpta  and  Rlndermsn  [11].  In 
particular,  suppose  F^  ■ N^(^^,£),  F^  ■ N^(p2>^)*  problem  Is 

sequentially  classifiable  based  on  observations  on  (X^,Xj^,X2)  If,  and  only 
If,  ^ f Pg.  The  problem  Is  sequentially  classifiable  or  finitely  classi- 
fiable based  on  observations  on  (x^,Xj^)  If  inf  ^ 

^ respectively. 


• 16  • 


I 


Following  Hoeffdlng  and  Wolfowlts  [19]  we  presenc  a "nlnlnum 
distance"  sequential  rule.  Consider  n observations  on  (X^.X^.Xg),  and 
let  be  the  empirical  c.d.f.  based  on  n observations  on 

X^  (la0,l,2).  Let  Q c 9 X 9,  where  3 is  a class  of  distributions  on 
the  space  of  and  let  d be  a uniform  consistent  distance  fmctlon 
defined  on  3x3.  (We  also  assume  that  d is  defined  cm  enplrlcal 
c.d.f. *s).  Assume  that 

- Cf^  X Fg  « n:  d(Fj,  Fg)  - 0) 

Is  null.  Let  {C^}  be  a sequence  of  positive  constants  decreasing  to  0, 
and  {N^}  be  a sequence  of  strictly  Increasing  positive  Integers.  Now  we 
define  the  rule  as  follows.  (See  [23]). 

Take  samples  of  sizes  n^,  Ug-n^,...,  until  ^ where 

- max{d(F^“^\F[“^b.  }.  Setting  N « n^^  make  the  ter- 

minal decision  as  follows.  Decide  F^  ■ Fj^  Iff  d(F^^^^ < dCF^^^^Fg*^^ ) . 

Given  a.  the  sequences  (n^}  and  chosen  such  that 

P(N  < <b)  ■ 1 and  the  probabilities  of  errors  are  less  than  a.  If  d Is 
Xolmogorov-distance  then  EN  < « besides  the  above.  However,  the  distri- 
bution of  N needs  further  study.  For  the  two-population  problem,  Kurz  and 
Uolnsky  [24]  suggested  a nonparametrlc  sequential  rule  based  on  Ullcoxon 
statistic  considering  the  situations  (b)  and  (c),  when  jFg(x)dFj^(x)  < 1/2 
or  Fg(x)  ■ F^(x-6)  with  known  8.  They  considered  asymptotic  properties 
of  their  rules  as  the  maxlmin  of  the  error  probabilities  (denoted  by  u)  tend 
to  0.  Following  the  technique  of  Chaw>Hobblns  they  proved  the  asymptotic 
efficiency  of  the  sas|>le  size,  although  the  reeult  Is  not  very  meaningful. 

i 
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Mor«ov«r,  th«y  proved  that  the  difference  between  the  mexlaua  of  the  error 
probabilities  of  their  rule  and  a tends  to  0 as  a tends  0;  such  a 
result  is  almost  trivial  and  throws  no  light  at  all  on  the  performance  of 
their  rule.  One  should  consider  the  limit  of  the  ratio  of  the  two  above 
instead  of  their  difference. 

Srivnstava  [39]  proposed  sequential  rules  when  and 

“ N^(^2,£)  in  the  following  two  cases:  (i)  ^ unknown; 

(ii)  Both  U]^~U>2  and  £ are  unknown.  Given  a,  he  constructed  a sequential 
rule  for  the  case  (i)  such  that  the  error  probabilities  of  the  rule  are  less 
than  a and  its  sample  else  is  asymptotically  efficient  (in  con^rison  to 
the  sample  size  %rhen  the  parameters  are  known  and  as  ^ 

For  the  case  (ii)>  he  showed  that  the  limits  of  the  error  probabilities  of 
his  rule  are  ^ a.  Srlvastava  followed  the  ideas  of  Chow-Robbins  and 
Simons  [37];  however,  his  proof  for  the  case  (i)  is  incomplete  and  it  is 
wrong  for  the  case  (ii).  The  main  error  lies  in  the  fact  that  the  notion 
of  a.s.  convergence  as  ^ ^ well  defined. 

For  the  situation  (a)  described  above,  there  are  many  apparently  good 
results  available  in  the  literature,  especially  in  Fu's  book  [I3]. 
Unfortunately,  siost  of  the  results  are  blind  copies  of  the  two-saople 
sequential  rules  and  they  lack  sufficient  rigor,  as  well  as,  meaningful 
formulation.  Son  heuristic  rules,  as  in  Smith  and  Tau  [38],  may  be 
studied  further  with  proper  rigor. 

Practically  very  few  interesting  results  are  available  in  the  study  of 
sequential  rules.  The  problem  requires  first  a msaningful  formulation  and  a 
useful  definition  for  as3mptotic  efficiency.  For  example  in  Srlvastava *s 
case  (ii)  no  asymptotically  efficient  sequential  rule  would  exist  and  one  may 


then  focus  on  the  loss  of  efficiency.  It  seems  that  the  problem  may  well 
be  studied  from  Chemoff 's  viewpoint  after  introducing  sampling  cost. 
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